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Unbiased, label-free proteomics is becoming a powerful technique 
for measuring protein expression in almost any biological sample. The 
output of these measurements after preprocessing is a collection of 
features and their associated intensities for each sample. Subsets of 
features within the data are from the same peptide, subsets of pep- 
tides are from the same protein, and subsets of proteins are in the 
same biological pathways, therefore there is the potential for very 
complex and informative correlational structure inherent in these 
data. Recent attempts to utilize this data often focus on the identi- 
fication of single features that are associated with a particular phe- 
notype that is relevant to the experiment. However, to date there 
have been no published approaches that directly model what we 
know to be multiple different levels of correlation structure. Here we 
present a hierarchical Bayesian model which is specifically designed to 
model such correlation structure in unbiased, label-free proteomics. 
This model utilizes partial identification information from peptide 
sequencing and database lookup as well as the observed correlation 
in the data to appropriately compress features into latent proteins 
and to estimate their correlation structure. We demonstrate the ef- 
fectiveness of the model using artificial/benchmark data and in the 
context of a series of proteomics measurements of blood plasma from 
a collection of volunteers who were infected with two different strains 
of viral influenza. 



1. Introduction. Unbiased, label-free, mass spectrometry proteomics, 
sometimes called "shotgun" proteomics is a technique for measuring nearly 
all abundant proteins in a biological sample. Because of numerous tech- 
nical advances it is becomming increasingly robust and sensitive, leading 
to greater effectiveness for the study of biological and medical questions 
(Aebersold and Mann, 2003; Service, 2008; Ping, 2009). While early work in 
this field met with a number of notorious failures (Petricoin et al., 2002; 
Baggerly et al., 2004; Zhang and Chan, 2005) due to overlapping peaks, 
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batch effects and systematic noise, high accuracy spectrometers along with 
multiple fractionation techniques such as liquid chromatography and ion 
mobility have led to increased robustness as well as improved qualitative 
and quantitative results. 

After summarization, data generated by this technology is typically pre- 
sented as a p x n dimensional matrix of real valued intensities where the 
number of measured features p is typically orders of magnitude larger than 
n, as in microarray gene expression data. However, there are a number of im- 
portant characteristics that distinguish mass spectrometry proteomics from 
gene expression data. First, each feature is a short peptide that has been 
enzymatically cut out of a parent protein, and parent proteins typically give 
rise to many such peptides. Second, only the more abundant of these features 
are typically identified (meaning that the peptide sequence and originating 
protein are known). Third, features that are present at lower abundances will 
typically have numerous missing values across the samples. Finally, while the 
error rate for assigning identifications to features is low, it is not zero, and 
this leads to some peptides with incorrect identifications. 

Analysis approaches for these data can be performed at the feature level 
or at the protein level. The obvious consequence of performing analysis at 
the feature level is a significant loss of power due to the highly dependent 
nature of subsets of the features — particularly those that originate from the 
same protein. We prefer a dimension reduction approach in which groups of 
features are collected and summarized prior to analysis of associations be- 
tween features and biological phenotypes. There are a number of approaches 
to this in the literature, almost all of which rely entirely on the identified 
features in the data set. 

The simplest of these approaches involves direct summarization of all or 
some features that are identified for each protein either through averaging 
or robust summarization based on quantiles (Polpitiya et al., 2008). There 
are also a number of regression approaches which include fixed effects for 
protein, peptide and experimental group (Karpievitch et al., 2009), include 
an additional random effect for situations in which subjects are measured 
in replicate (Daly et al., 2008), or add additional interaction effects between 
treatment and features (Clough et al., 2009). These may assume constant 
or varying noise levels across isotope groups, and have been shown in some 
cases to exhibit better performance than naive summarization approaches 
that do not adjust for confounding factors (Clough et al., 2009). 

We are aware of only one approach to the analysis of these data that 
examines correlation structure between data features (Lucas et al., 2012). 
This approach utilizes a latent factor model to aggregate features, and uses 
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priors on the loadings that are informed by identifications. This leads to 
aggregation of multiple features into "metaproteins" . This is a sparse factor 
modeling approach where non-zero loadings for factor i are biased toward 
features that are identified as originating from protein i. While this approach 
allows the utilization of unidentified features in the data, it fails to account 
for correlation structure that arises when multiple proteins are involved in 
the same pathways. 

In this paper, we present an extension of Lucas et al. (2012) that explicitly 
models correlation structure between factors. We do this by incorporating 
a hierarchical structure on the latent metaproteins that allows borrowing 
strength between factors to estimate overall factor scores. We demonstrate 
improvements over both a generic sparse factor model Carvalho et al. (2008) 
and an earlier proteomics factor model Lucas et al. (2012), in terms of accu- 
racy of factor estimates and eventual association with biological phenotypes. 
Finally, we demonstrate the incorporation of known correlation structure in 
the form of repeated measures in our analysis of a viral challenge data set 
in Section 6. 

2. Motivating data. While the specifics of data generation may vary 
at different proteomics laboratories, the model we describe is appropriate 
for any high-accuracy mass spectrometry data. In general, the steps to data 
generation are as follows: (i) A biological sample is distilled to a solution 
containing those proteins that are of interest; (ii) The proteins in the sample 
are then broken up via trypsin; (iii) The processed sample is separated ac- 
cording to hydrophobicity using liquid chromatography. The time at which 
a particular constituent of the sample passes out of the chromatography 
column is called retention time; (iv) An electric charge is induced on the 
peptides; (v) The mass and intensity of these ions is measured in a mass 
analyzer. The intensity and ion masses are measured at a regular interval, 
called sampling rate and the resulting measurements form a trace with visi- 
ble peaks, called features, that correspond to one or more peptides. Because 
the sampling rates are high relative to the size of these features, each feature 
spans a range of mass-to-charge ratios and retention times. 

In nature, approximately 1% of all Carbon atoms are Carbon-13 (they 
contain an extra neutron). This leads to multiple features per peptide, each 
one containing a different, integer number of Carbon-13 atoms. These are 
collected into a single isotope group (IG) during preprocessing, and the in- 
tensity of this isotope group is estimated as the total volume under its 
associated features. In addition to multiple features from Carbon-13 substi- 
tution, a peptide may be present in the data set multiple times at different 
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charge states. These different charge states will have different mass to charge 
ratios and therefore result in multiple isotope groups per peptide. 

There are inherently two different types of correlation present in label- 
free, unbiased proteomics data. First, each isotope group originates from a 
particular protein and there are typically many isotope groups per protein 
in the data set — particularly for proteins that are highly abundant and/or 
of large molecular weight in the original sample. Second, some collections 
of proteins are expected to behave similarly because they are part of the 
same biological pathways. This will result in correlation between proteins 
(and therefore correlation between isotope groups) that are of distinct etiol- 
ogy. In general, distinct sources of correlation are confounding without some 
additional information allowing us to distinguish them. In the case of pro- 
teomics, there are techniques for identifying the specific amino acid sequence 
of a subset of the isotope groups that are present at relatively high concen- 
trations. These sequences are then associated to particular proteins through 
sequence alignment to proteins in a database (Nesvizhskii et al., 2003). We 
have then, for a limited subset of the isotope groups a (possibly imperfect) 
peptide sequence and originating protein, we call annotation. 

The proteomics data we will be focused on was obtained from 43 patients 
as part of the DARPA H1N1/H3N2 viral challenge project (Zaas et al., 
2009). From the entire pool, 24 patients were exposed to H1N1 and 17 
were exposed to H3N2. For each patient, four samples were taken at differ- 
ent reference time points, baseline (t = 0), the time of maximum symptoms 
(t = 1) as well as t = 0.2 and t = 0.8. Each subject was labeled as symp- 
tomatic (SX) or asymptotic (ASX) based on self-reported symptom scores, 
as well as viral culture. The samples of the H3N2 study were run in two 
batches with the initial pilot study containing only samples from time points 
t = {0,1} and the followup containing the t = {0.2,0.8} samples. In sum- 
mary, we have N = 172 samples from two studies (H1N1 and H3N2) divided 
in three batches (H1N1, H3N2i and H3N2 2 ), two conditions (SX and ASX) 
and where the data itself is a matrix of approximately 40, 000 IG expressions 
per sample. Peptide annotation was done using a combination of Mascot and 
PeptideProphet algorithms (Keller et al., 2002; Perkins et al., 1999). Isotope 
groups from the three batches were aligned using the algorithm described in 
Lucas et al. (2012). From all IGs, 13845 were successfully aligned across the 
H1N1 and H3N2 data sets. From the set of 4670 annotated IGs, only 1697 
had annotations in both data sets. The set of annotations consists of 239 
proteins from which only 106 are assigned to more than a single IG. The 
data has a relatively low overall missingness rate, most of them among low 
abundance IGs. However, missing values are unevenly distributed: H3N2i 
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having 10.3% missingness, H3N22 0.7% and H1N1 up to 2.5%. Two samples 
were removed from subsequent analysis because they had more than 30% 
missing values in the set of annotated IGs. 

3. Model definition. We model a sample n of batch m consisting of p 
IG expressions, x™, as an extended factor model separated into four effects, 
namely batch, systematic, protein expression and noise 

(3.1) x™ = /i m + Az n + Bw n + e n , 

where x™, fi m , z n , w ra and e n are p x 1 vectors. In particular, fj, m is the 
mean expression vector of batch m, factors z n = [z\ n . . . ZN Fn ] T are meant 
to capture systematic Np effects, w n is the expression level of Np proteins 
for sample n, A and B are p x Np and p x Np loading matrices for the 
systematic effects and protein expressions, respectively, and e n is measure- 
ment idiosyncratic noise. Systematic effects are included in the model for 
the sole purpose of cleaning the data as much as possible from batch effect 
specific and technical noise, with the aim to obtain protein profiles {w rt }, 
that better reflect true biology rather than technical variability. Provided 
that protein expression is not directly observed and because profile vectors 
[wki ■ ■ ■ WkN] are likely to be estimated from IGs that belong to multiple 
proteins, from now on we refer them as latent proteins. A-priori, we let each 
IG to be associated only to a single latent protein, say k, meaning that row 
of B contain just one non-zero entry. 

Identifiability in the model of equation (3.1) follows from well known re- 
sults for standard factor models and it should not be an issue for several 
reasons: (i) Confounding between systematic effects and metaproteins is very 
unlikely because A is essentially dense and B is actually a vector — assuming 
fixed IG-protein associations, (ii) w n does have a sign ambiguity that can 
be easily handled by letting B have non-negative entries, (iii) z n is can be 
identified up to scale and permutations if Np is small enough or for arbitrary 
Np as long as its distribution is non-Gaussian (see Kagan, Linnik and Rao, 
1973). Scale and permutation ambiguities are not of great concern here be- 
cause we are not interested on the interpretation of systematic effects. 

3.1. Prior specification. We need to specify prior distributions for each 
one of the elements in the right hand side of equation (3.1). Measurement 
noise is set to a zero- mean Gaussian with diagonal covariance matrix 
to allow for different noise variances for each IG. Entry specific priors for 
\I/ are set to flat inverse gamma distributions with shape t s = 1.1 and rate 
t r = 0.001, the former to keep the variance bounded away from zero. Mean 
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batch effects have Gaussian priors with mean t m = 8 and small precision 
t p = 0.01, set mainly based on the overall mean expression of the data. 

3.1.1. Systematic effects. We define systematic effect as a portion of vari- 
ability expressed in a large collection of isotope groups that cannot be clas- 
sified neither as non-specific measurement noise nor biological variability, 
meaning that it is more likely due to technical variability. These effects 
are usually characterized by high levels of correlation across many isotope 
groups, but potentially only in a subset of the samples (for example, only 
those in one batch). We capture the first part through the use of indepen- 
dent Gaussian priors on the elements of A, which allows systematic effects 
to span the entire set of isotope groups. Aiming to allow individual sam- 
ples to be largely dropped from specific systematic factors, we utilize in- 
dependent Laplace priors for the elements of z n . These are parameterized 
as scale mixtures of Gaussians with exponential mixing densities to facili- 
tate inference (Henao and Winther, 2011). We consider that the number of 
systematic factors Np is not critical because we are not concerned about 
the interpretability of matrix A. Besides, we have observed empirically that 
the variance explained by the systematic effect factors saturates quickly as 
Np increases. However, we decided to place an automatic relevance deter- 
mination (ARD) prior on A (Ncal, 1996). In particular, being a,j and Zj n 
elements of A, and z n , respectively, we have 

ctij ~ A/"(0, pj) , pj ~ Gamma(r r ,r s ), 

Zj n ~ A/"(0, Tj n ) , Tj n ~ Exponential(A 2 ) , A 2 ~ Gamma(4,^)i 

where pj is a shared factor- wise variance for the columns of A, r,- n is an 
auxiliary variance with exponential mixing so marginally, Zj n ~ Laplace(A 2 ) 
(Andrews and Mallows, 1974). We further place a gamma hyperprior on 
the rate of the Laplace distribution with parameters l s = 4 and £ r = 2. 
The mechanism behind an ARD prior is rather intuitive, for large values of 
Pj will correspond to small values of the j-th column of A, thus virtually 
switching off the entire effect. Setting r r = 1.1 and r s = 0.001 will encourage 
the desired behavior. In practice, the effective number of factors can be 
determined by thresholding pj or the elements of A column-wise. 

3.1.2. Latent protein profiles. We make two assumptions regarding iso- 
tope group expression. One is that each isotope group originates from only 
one latent protein and the other is that latent proteins may correlate with 
each other due to biological pathway activity. To model the first feature, we 
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set a prior hierarchy as follows 

bi, Ui \ui ~ jV+(0, 1), m\vi ~ Discrete (vj ) , Vj|a ~ Dirichlet(alAr p ) , 

where bij = if j ^ Ui, is the Gaussian distribution truncated below 

zero and where the i-ih IG is associated with the latent protein indexed 
by Ui with probability Vj. This means that the vector u serves as labeling 
variable for IGs. The conjugate prior for the vector of Np probabilities Vj, 
is set using a shared concentration parameter a. For the latter, we provide a 
flat gamma prior with parameters a s = 1 and a r = 1 (see Escobar and West , 
1995). 

We know that groups of proteins might have similar expression profiles 
for different reasons, for example because they are structurally similar, me- 
diate similar biological processes, share a pathway, etc. In order to capture 
this structure, we place a prior over binary trees on the Np latent pro- 
teins. This allows us to model correlation among metaproteins and leads to 
an interpretable representation of isotope groups, latent proteins and their 
interactions. Figure 1 illustrates the concept for a particular setting with 
p = 15 IGs distributed in Np = 5 proteins. We can see a hierarchical clus- 
tering structure in which for instance latent proteins w% and W2 are more 
similar than w<± and W5, thus more correlated. The pseudo time tj at which 
two nodes merge into Vj acts as similarity measure so that more alike latent 
proteins merge sooner in time, allowing us to directly quantify their pair- 
wise or group- wise similarities. The proposed hierarchy is an implementation 
of the Kingman's coalescent (Kingman, 1982a), and reflects the idea that 
isotope groups and latent proteins lay in different levels and that protein 
pathways are proxies for the average profiles of collections of proteins. 

Given a tree structure, {t, 7r}, where t is the vector of merging times and 
7r is the set of partitions at each level of the tree, we specify the relation- 
ship between node Vj and its parent node (or Wk at the leaves) through 
a multivariate Gaussian transition probability and set the following prior 
hierarchy 

(3.2) Vj\v k ,tj,t k ,& ~ jV(vfc, (tfc-t-,-)*), {t,7r} ~ Coalescent (N p ) , 

where Vj is a ^-dimensional row vector and $ is a covariance matrix encod- 
ing the correlation structure in v,-. In a nutshell, a coalescent prior selects 
a pair to merge uniformly from partition ttj and sets merging times with 
rate 1, this is i& ~ Exponential(l). With no further constraints, this prior 
distribution leads to a uniform prior distribution over trees — the coalescent, 
that is independent of merging times and that is infinitely exchangeable 
(Kingman, 1982a, b). Different priors for $ add flexibility to the model, for 
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isotope groups latent proteins protein pathways 

Fig 1 . Latent protein tree structure. Particular tree with Np — 5 and three isotope groups 
assigned to each latent protein. The pseudo time variable t defines the merging points. 

example in the i.i.d. case, a diagonal 3> with independent inverse gamma 
prior distributions on each diagonal element will accommodate for differing 
levels of noise for different samples. In cases where there is reason to suspect 
for correlation across samples, a different prior could be used. In our anal- 
yses for instance we use inverse Wishart priors to model correlation due to 
sample replicates and Gaussian process priors for smoothness in time series 
data. Inference for hierarchy in equation (3.2) is carried out using an efficient 
sequential Monte Carlo Sampler introduced by Henao and Lucas (2012). 

3.2. Inference. Model fitting is performed using Markov chain Monte 
Carlo (MCMC) to collect samples from the posterior of all parameters in 
the model, namely /x m , A, z n , B, w n , u, tt and The most relevant 
summaries involve posterior samples from the latent proteins, IG-protein 
assignments and the hierarchical structure encoded by the binary tree, it. 
Nearly all quantities of interest are updated using Gibbs sampling except for 
the tree components that require sequential Monte Carlo (SMC) sampling. 
In all the experiments we set the hyperparameters of the model to the values 
already mentioned unless otherwise stated. The upper bound for the number 
of factors is set to a conservatively large value, we have observed in practice 
that Np = [21og(p)J is large enough. For tasks with p and N in the lower 
thousands and hundreds, respectively, we can expect the inference routine to 
take less than a couple of hours in a desktop machine. The entire sampling 
sequence is fully described in Appendix A. 

Summaries for most of the important quantities of the model are com- 
puted in the usual way by means of histograms and empirical quantiles. Sum- 
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marizing trees on the other hand is not such an easy task because tree aver- 
aging is not a well defined operation. We could in principle use the pseudo 
time variable to build a pairwise distance matrix between latent proteins and 
then attempt to build a tree from a summary of such a similarity matrix. The 
problem being that we do not have any guarantee that this average of binary 
trees will produce binary tree as well. We tried this approach with both arti- 
ficial and real data, and found that the tree built using means or medians of 
the similarity matrices collected during inference oftentimes produced trees 
with non-binary branching, thus not matching the prior assumption. In view 
of this, we decided to select a single tree from all the available samples using 
as criterion the marginal likelihood of the tree. This is a common practice 
in tree based models, see for instance Teh, Daume III and Roy (2008) and 
Adams, Ghahramani and Jordan (2010). 

The source code and demo scripts for the model presented in this paper 
is written in MATLAB and C, and has been made publicly available at 
http : //www . duke . edu/~rhl37/lpt_v0 . 2 . tar. 

4. Artificial data. We begin with a set of experiments using artificially 
generated data in order to illustrate some of the features of our model and to 
perform some quantitative comparisons. We generated two datasets D\ and 
D 2 of sizes {p,N,N B ,N F ,N P } = {200,80,3,5,16} and {400,80,3,8,32}, 
respectively. Denoting the elements of fx m , A, B and * as fj,™, a^, bik and 
ipi, respectively, we draw N observations of the model from the following 
hierarchy 

x-~A/V\£), 

/4™ ~ A/"(8, 2) , m ~ Discrete^ 1 l Ng ) , 

aij ~ AT (0,0.1) , 
\m ~ A/+(0, 1) , Ui ~ Discrete (v) , 

V^ 1 ~ Gamma(l.l, 0.02) , v ~ Dirichlet(c»!) , 

S" 1 ~ Wishart(I, N P ) , a ~ Uniform(0.8, 2.4) , 

where S = AA T + BSB T + A is a p x Np matrix of systematic factor 
loadings, B is a p x Np matrix of latent protein loadings, S is the covariance 
matrix of the latent protein profiles and is the noise diagonal covariance 
matrix, as in equation (3.1). We generated 50 replicates of each dataset 
and uniformly flagged 20% of its values as missing. We run our sampler for 
2500 iterations, using the first 1500 as burn-in period. For this experiment, 
we set the distribution of the systematic factors to Gaussian, to match the 
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Table 1 

Structural measures for artificial data. Nf is selected with threshold Tj < 10 3 . Pairs in 
brackets are empirical 90% intervals across replicates. Best results in boldface letters. 



Set 


Method 


iV F 


Identity 


Confusion 


Di 
D 2 


LPT 

sLPT 
LPT 
sLPT 


5 (4,7) 
5 (4,7) 
8 (7,10) 

8 (5,10) 


0.94 (0.88,1.00) 

0.93 (0.81,1.00) 
0.97 (0.93,1.00) 

0.93 (0.86,0.97) 


0.01 (0.00,0.04) 

0.02 (0.00,0.06) 
0.01 (0.00,0.02) 

0.03 (0.01,0.06) 



assumption made in S. Since we are not introducing correlation across sam- 
ples, we set to diagonal with independent gamma priors. The average 
number of systematic factors is selected with threshold pj < 10 3 . We label 
each latent protein by tabulating the IGs associated to it from vector u and 
then picking the label having maximum count. We define identity as the 
percent of correctly labeled latent proteins, and confusion as the percent 
of variables incorrectly associated to their latent proteins. We compare our 
model (LPT) with (i) its simplified version without the tree structure infer- 
ence we call sLPT, thus without covariance structure in the latent profiles 
(Lucas et al., 2012). Table 1 shows results for the structural components the 
model, namely number of systematic factors, identity and confusion. Results 
evidence our models ability to capture the associations between IGs and la- 
tent protein profiles through u while properly handling systematic, batch 
effects and missingness in the data. 

We can also asses the performance of our model in terms of covariance ma- 
trix and missing values estimation. We compare LPT and sLPT with a sparse 
factor model as proposed by (sFM, Carvalho et al., 2008) but equipped with 
the same priors for missing values and batch effects used by our model. For 
sFM we set the number of factors to Np + Np = {21,24}, accordingly. 
In principle, the sparse model is flexible enough to estimate A and B but 
not S for the model assumes independent profiles, similar to sLPT. Table 2 
shows summaries of mean square error (mse), mean absolute error (mae) 
and maximum absolute bias (mab) across replicates for the methods under 
consideration. As seen in Table 2, our model performs better than the other 
two alternatives. 

The entire experiment was repeated for small variations in the hyperpa- 
rameters of the models and the artificial data generator without considerable 
changes in the performance measures. In general terms, we observed good 
mixing in the sampler from exploratory and standard diagnostic tests. We 
also repeated the experiment with correlation across samples and inverse 
Wishart distribution for the matrix with results similar to those shown 
in Tables 1 and 2. 
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Table 2 

Performance measures for artificial data. sLPT is the simplified LPT and sFM is a 
sparse factor model, MSE, MAE and 10 _1 xmab are mean squared error, mean absolute 
error and maximum absolute bias, respectively. Pairs in brackets are empirical 90% 



intervals. Best results shown in boldface letters. 


Set 


Measure 


LPT 


sLPT 




sFM 


Covariance 












MSE 


1.462 (0.832,3.057) 


7.297 (3.267,15.746) 


9.184 (4.669,17.976) 


Di 


MAE 


0.929 (0.702,1.378) 


1.994 (1.401,2.884) 


2.120 


(1.475,3.223) 




MAB 


0.606 (0.373,1.209) 


1.065 (0.704,2.036) 


1.558 


(0.982,2.474) 




MSE 


1.242 (0.864,1.800) 


4.141 (2.638,7.527) 


4.314 


(2.991,7.699) 


Di 


MAE 


0.869 (0.725,1.067) 


1.416 (1.174,1.857) 


1.400 


(1.137,1.849) 




MAB 


0.658 (0.444,1.078) 


1.079 (0.853,1.582) 


1.396 


(1.158,3.686) 


Missing values 












MSE 


0.212 (0.103,0.587) 


0.253 (0.098,0.526) 


1.604 


(0.908,2.481) 


Di 


MAE 


0.224 (0.199,0.298) 


0.235 (0.193,0.281) 


0.667 


(0.531,0.892) 




MAB 


0.776 (0.339,1.238) 


0.722 (0.409,1.271) 


0.956 


(0.780,1.372) 




MSE 


0.225 (0.144,0.464) 


0.260 (0.165,0.493) 


1.716 


(1.095,2.442) 


Di 


MAE 


0.227 (0.209,0.255) 


0.233 (0.214,0.267) 


0.734 


(0.601,0.910) 




MAB 


0.900 (0.628,1.636) 


0.929 (0.601,1.698) 


1.027 


(0.826,2.112) 



5. Spike-in data. The benchmark dataset originally introduced by 
Mueller et al. (2007) consists of 6 samples measured in three replicates. 
Each sample is a mixture of six non-human purified proteins in different 
concentration levels spanning two orders of magnitude from 25 to 800 fmol. 
Figure 2(a) show in dashed lines ground truth concentrations in log-scale 
and scaled between and 1. The raw data containing approximately 57000 
IGs was filtered down to 1841 IGs after identification, annotation and exclu- 
sion of IGs with 50% missing values or with less than 10% of the maximum 
variance IG. We only count with annotations for 88 IGs, this is 4.7% of the 
set. In particular, the final dataset contains 18 observations and 1841 IGs 
labeled as 7 proteins, namely, ADH1-Y (12), ALDOA-R (20), CAH2-B (13), CYC-H 
(24), LYSC-C (9), MYG-H (10) and UKN (1753), with the number of IGs per 
protein in parentheses and UKN denoting unannotated IGs. The data matrix 
has a missingness of 30% more or less evenly distributed across observations. 
Since the dataset is relatively clean and all the samples were obtained in a 
single session, we do not expect systematic nor batch effects. However, we do 
expect high correlation due to replicates, thus we provide $ with an inverse 
Wishart prior with 10 x N degrees of freedom and scale matrix composed 
of 6 blocks of magnitude 0.9 and size 3 plus 0.1 times the identity matrix. 
Although, learning the degrees of freedom and the blocks/diagonal propor- 
tions will be more principled, we did not observe substantial changes in the 
results from small changes in the previously mentioned values. We run the 
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MYG-H CAH2-B CYC-H LYSC-C ADH1-Y ALDOA-R 




(a) (b) 

Fig 2. Spike-in data profiles, (a) Ground truth (dashed) and estimated (solid) protein 
profiles scaled between and 1. Replicates are shown as squares and solid lines are averages 
across replicates, (b) Median IG expression grouped according to the labeling obtained 
during inference and averaged across replicates. Dashed lines correspond to original data 
with missing values and solid lines to data with missing values replaced by their estimates. 
Credible intervals were omitted for clarity. 



sampler for 4000 iterations with a burin-in period of 2000. 

Figure 2(a) shows the summary of the estimated latent protein profiles 
scaled between and 1. Each circle represents a replicate, solid lines are 
averages across replicates and dashed lines represent the scaled ground truth 
(see Mueller et al., 2007). Summaries were computed using medians and 
credible intervals were omitted for clarity. Compared to the ground truth, 
our model does a pretty good job at capturing the underlying profiles of 
all 6 proteins of interest despite of the large amount of missing values and 
unannotated IGs. 

Availability of the true protein profiles allows to quantitatively evaluate 
the performance of our model. We compare three different models in this 
case: (i) full i.i.d. latent proteins, meaning no tree structure prior; (ii) inde- 
pendent gamma distributions and diagonal assumes no correlation due 
to replicates and (iii) inverse Wishart prior for <1> with scale matrix as al- 
ready described. Results of model (ii) also appear in Henao et al. (2012). 
Although the three models produce profiles similar to those shown in Fig- 
ure 2(a), there are small differences. Table 3 indicates that in terms of mse, 
mae and mab the summaries of the model with the inverse Wishart prior are 
more accurate compare to the other two that do not account for correlation 
across samples. 

We can use the labeling vector u to examine how unannotated isotope 
groups were labeled after inference. In particular, ADH1-Y went from hav- 
ing 12 IGs to 118, ALDOA-R from 20 to 307, CAH2-B from 13 to 240, CYC-H 
from 24 to 288, LYSC-C from 9 to 189 and MYG from 10 to 185. Figure 2(b) 
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Table 3 

Performance measures for spike-in data, mse, mae and mab are mean squared error, 
mean absolute error and maximum absolute bias, respectively. 



Measure 


no tree 


tree with prior 






indep. gamma inverse Wishart 


lO^XMSE 


2.524 


1.899 1.661 


10 2 XMAE 


3.172 


2.983 2.494 


10 1 XMAB 


1.443 


1.252 1.213 



shows median IG expression grouped according to the labeling vector u and 
averaged across replicates to make easier comparisons against the ground 
truth in Figure 2(a). Dashed and solid lines correspond to data with and 
without missing values, respectively. For the latter, we have simply replaced 
the missing values with those estimated by our model. We see that for every 
protein, the inclusion of estimated missing values improves the expression 
average, mostly in the lower end of the expression range. We point out that 
a similar picture without using vector u but the labeling from annotation in- 
stead, does not resemble at all the ground truth. This is because the original 
labeling comprises 88 IGs with a considerable amount of missing values. 

6. H1N1/H3N2 viral challenge. We present now the case study 
based on the motivating data already described in Section 2. Here we will 
be using only the set of 4670 annotated IGs for which we have at least 2 
IG per protein. The model has thus sizes n = 172, Np = 3, Np = 16 and 
Np = 106. In this case, each observation can be seen as an element of a time 
series of length 4, t = {0, 0.2, 0.8, 1}. If we let latent proteins have Gaussian 
process priors with squared exponential covariance function and assuming 
no sample correlation across patients, we can compute the entries of from 

4>{h3) = c ij ex P y~J d ij) + 0-2 S v ' 

where £ is the inverse length scale, a 2 the idiosyncratic noise variance, 5ij = 
1 only if i = j, Cij = 1 only if i and j are from the same patient, and 
dij = U — tj is the time difference between pair {ti,tj} G {0,0.2,0.8,1}. 
Hyperparameters i and a 2 are updated using slice sampling (Neal, 2003). 
We run the inference procedure for 5000 burn-in iterations followed by 2000 
samples to compute summaries. The whole procedure takes approximately 
2.5 hours in a regular desktop machine with 4 cores. Mixing was monitored 
using both exploratory and standard diagnostic tests. Table 4 reports the 
resulting structural components of the model, namely previously described: 
number of systematic factors, Np, identity and confusion. Besides, we define 
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Table 4 

Structural measures for viral challenge data. Nf is selected with threshold r, < 10 3 and 

stability with threshold 0.6. 

Nf Identity Confusion Stability Unique 
3 0.774 0.511 0.958 0.783 



stability as the proportion of IGs having a single value in the label vector 
u for at least 60% of the MCMC samples after the burn-in period. We also 
define unique as the proportion of latent proteins with distinct labels. 

From Table 4 we see that nearly half of the IGs ended up with a pro- 
tein label different to annotation. In real data this is not so surprising as 
the chances of miss-annotation considerably increase due to systematic ef- 
fects, post-translational modifications, measurement error and alignment in- 
duced miss-labeling. As an example, consider the problem of aligning batches 
H1N1, N3N2i and N3N22- Initially, the three batches have different sets of 
notations that need to be matched to conform a common annotation set. We 
use the alignment algorithm described in (Lucas et al., 2012). From the 4670 
IGs included in the model, only 36% of them shared annotation, whereas for 
the remaining 64% IGs, annotation was transferred from one of the batches 
to the other two. This means that more than half of the IGs are more prone 
to miss-annotation due to alignment. From the set of 2220 IGs (47%) that 
kept the label from annotation nearly half of them are part of the set of 
IGs with H1N1/H3N2 shared annotation, suggesting that IGs annotated 
simultaneously in all sets tend to be more reliable than those labeled by 
label transfer. The identity of the model on the other hand, indicates that 
82 latent proteins match annotation when labeled by consensus of their IG 
members. Having 82 (78%) unique latent proteins implies that some proteins 
exist in multiple instances. For example, we count with 6 versions of APOB- 
H, which is in fact the most frequent protein in the set. Figure 3(a) shows 
the composition of all latent proteins. For each latent protein (column), 
we tabulate and sort the labels of its IG members. Darker colors represent 
proportions closer to 1. The first row is used to compute the consensus to 
determine identity. The red bar indicates wether the most frequent IG in 
a given latent protein is represented by less than 30% of the IGs assigned 
to it. The top bar shows in dark the 82 identified proteins. All in all, we 
see that for most latent proteins, the most frequent IG has an important 
contribution and that no latent protein has IGs from more than 17 different 
labels. Besides, we see correlation between the identified proteins and the 
proportion of their most frequent IG. 

We can also use latent proteins as predictors of the symptomatic vs. 
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Fig 3. Protein identification and status classification, (a) Latent protein sorted composi- 
tions, darker colors indicate proportions closer to 1. The red bar separates latent proteins 
in which the leading IG has proportion less than 30%. The top bar shows in dark the 
identified proteins, (b) Classification accuracy presented as AUC values estimated using 
leave-one-out cross-validation. Markers indicate median values and error bars cover 90% 
credible intervals. 



asymptomatic status of each observation in the dataset. For this purpose, we 
fit individual linear discriminant classifiers for each latent protein at each 
MCMC draw and estimate the classification accuracy as the area under 
the ROC curve (AUC, Receiver Operating Characteristic, Fawcett, 2006). 
Figure 3(b) shows results for six of the most discriminant latent proteins: 
FHR1-H, ZPI-H, CRP-H, LBP-H, A2GL-H and C09-H, it shows in particular that 
FHR1-H has an overall decent performance. In addition, when treating H1N1 
and H3N2 as separate classification tasks, we observe that H3N2 is somehow 
easier to classify as can also be seen quite clearly from Figure 3(b). 

As described in Section 3, the prior distribution for the set of latent 
proteins allows us to build a binary tree representation of its elements in 
a hierarchical clustering fashion. When examining the resulting structure 
(see Supplement A) we found some straightforward groupings in the tree 
mostly corresponding to protein variants like APOC2-H and APOC3-H, C08A- 
H, C08B-H and C08G-H, FIBG-H and FIBB-H, F13A-H and F13B-H, etc, all of them 
having similar profiles when looking at their estimated signatures (results 
not shown). In other cases, for instance C04(a,b)-H and APOB-H, showing great 
diversity in their profiles and as a result rather spread in the structure. 

Figure 4 shows the subtree corresponding to 4 of the discriminant pro- 
teins from Figure 3(b) along with a scatter of the expression values of each 
latent protein. Each panel in the figure shows expression in the y-axis and 
data grouping in the x-axis. Data to the left hand side of the dashed ver- 
tical line corresponds to the asymptotic set whereas the other side contain 
symptomatic observations. Each side is further grouped according to time, 
so points closer to the dashed vertical line are for t = (green), then t = 0.2 
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Fig 4. Discriminant subtree. Each node is represented as a scatter of the samples from 
study H3N2 only. The dotted line separates samples labeled asymptomatic (left) and symp- 
tomatic (right). The x-axis groups samples according to time: red for t = 0, yellow for 
t = 0.2, red for t — 0.8 and purple for t = 1. The y-axis is the estimated latent/protein 
pathway expression. Solid lines connect group means. Note that time points t = 0.8 and 
t = 1 are nicely separated. 



(yellow), t = 0.8 (red) and the farthest to t = 1 (purple). The good separa- 
tion of observations from times t = {0.8, 1} is the main responsible for the 
classification results shown in Figure 3(b) 

It should be noted that the DARPA study collected samples from multiple 
other sources, and that there is published, publicly available gene expression 
data from the peripheral blood of the same patients we have examined here. 
That data is analyzed in Zaas et al. (2009) and a time course trajectory 
model is developed on a more complete version of the data in Chen et al. 
(2011). Together with the proteomics data included in the supplementary 
material of this paper (see Supplement B), these offer interesting possibilities 
for future work into jointly modeling proteomics and gene expression data. 
We have briefly examined correlation between protein and matched gene ex- 
pression in these data sets, but find that it is generally quite low. However, 
a brief examination of the top genes discovered in Zaas et al. (2009) and the 
five discriminative proteins elucidated here shows a high overlap in associ- 
ated pathways. We suspect that these results, and in general joint analysis 
of these data is complicated by the tissue of origin. Specifically, it is not clear 
that the proteins in blood plasma originate from peripheral blood mononu- 
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clear cells (from which there is published gene expression data). Instead, it 
is likely that much of the observed protein expression is due to activities in 
organs such as the liver or kidneys and from the endothelial lining of blood 
vessels. 

7. Concluding remarks. We have presented a factor model specif- 
ically designed for proteomics data analysis. It successfully handles broad 
scale variability that is known to come from technical sources (such as batch 
effects and isotope group specific noise) hence enabling us to estimate latent 
protein profiles that better describe biological variability. Our hierarchical 
representation of isotope groups, latent proteins and protein pathways pro- 
vide us with detailed annotation uncertainty assessment, detection of possi- 
bly inaccurately annotated isotope groups and post-translationally modified 
proteins and clustering of proteins with similar expression profiles that re- 
flect biologically related interactions. We have also shown that features of 
our model can be used to define predictive models based either on latent 
proteins or groups of latent proteins. 



We describe next the MCMC analysis mostly based on Gibbs sampling. 
We provide then the relevant conditional posteriors and SMC based update 
details for the tree structure. To simplify notation, we use the following 
shorthands. Let X m = [xf ■■■ x%J and X = [X 1 ••• X Wb ], where N B 
is the number of batches, N m is the number of samples in batch m and 
N = J2m=i Nm- Define 1^ to be a k— dimensional row vector of ones and 
let X. be the full data set with the appropriate means subtracted off, this 
is X = [X 1 - ^Ijvj ••• X* - fi N *l K ], and Z = [zi ••• ztv] and 
W = [wi • • • wat], systematic factors and latent protein matrices of sizes 
Np x N and Np x N, respectively. For any matrix M, define Mj : as its i-th 
row and M :J - to be its j-th column. 

Noise variance. Sample each element of the diagonal of * from 

ip^ 1 \t s ,t r ~ Gamma (t s + y , t r + c) , 
where t s and t r are respectively prior shape and rate and 



APPENDIX A: MCMC INFERENCE DETAILS 



c = |(Xi: - A i: Z - B i; W)(X i; - A l: Z - B i: W) 



Batch means. Sample mean vector for batch m from 



fi m \t m , t p ~ M C t m t p + * - 1 K ~ Az n - Bw ; 




n=l 
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where C = (t p + N m ^ ) , t m and t p are prior mean an precision, respec- 
tively. 

Systematic effect factors. The conditional posterior of Z, using a scale mix- 
tures of Gaussians representation, can be computed independently for each 
element of the matrix using 

T lT ,-l 



Zjn\T~jn ~ A/" yCjnA-.jty €\j n ,Cj n 

where c jn = (A^* _1 A^+r^ 1 )" 1 and e\ jn = x n - Az n -Bw n - n m \z jn = 0. 
The mixing variances Tj n are exponentially distributed with rate A 2 , hence 
the resulting conditional posterior is 



T ■ A ~ 



IG (\/l^' A2 ) ' A2|4 '^~ Gamma ^ + i4 + |j73 B | , 



where £ s and i r are shape and rate priors, respectively. IG(-|/i, A) is the 
inverse Gaussian distribution with mean /i and scale A (Chhikara and Folks, 
1989). Each element a^- from the loading matrix A is sampled from 

~ M (cije\ijZj.,Cijil)i \ , 

where Cjj = (Zj : Zj + ipipj)' 1 and = Xj : — Aj : Z — Bj : W|ajj = 0. Then, 
column- wise precisions for A are drawn from 

Pj\r s ,r r ~ Gamma ^r s + |, r r + ^ a? , 

where r s and r r are prior shape and rate, respectively. 

Protein profiles. The conditional posterior for latent proteins W can be 
updated from 

W fc: |v fc ~ M(CBI*-\X- AZ) + CS- 1 m fc ,c) , 



where C = (B^Sf^B^ + Sj~ with and being mean and covari- 
ance of the parent profile of . Note that = for all isotope groups 
not assumed to be part of this protein, and that these will not contribute 
to the update distribution for W^.. Besides, 

kklhk + ~ Af+ fc(Xi: - A i: Z)Wl,c^) , 
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where c = (Wfc : Wj. + ipi) 1 and A/+(-) is the Gaussian distribution trun- 
cated below zero. Now we can sample IG-latent protein assignments from 

Ui\a,K,t s ,t r ~ Discrete (vj) , 



where re& is the number of non-zero entries in column k of B, c = W^W^,. 
and Vk is the k-th. element of Vj. 

Protein pathway expression and tree structure. We sample the tree struc- 
ture components t, 7r and 3>, and the means and covariances of each internal 
node of the tree, and respectively, using the SMC sampler described 
in Henao and Lucas (2012). In particular, {t,7r} are obtained for a number 
M of particles, as a leaves to root SMC pass, together with partial updates 
of the node parameters {m/%, S^}. Next we use particle's weights to sample 
a single configuration. The procedure is completed by resampling the hy- 
perparameters of the covariance function and by completing the updates of 
the node parameters using the selected configuration, the latter as a root to 
leaves pass. 

Missing values. For each missing value x 7 ^ corresponding to isotope group 
i, sample n and batch m, we simply use independent standardized Gaussian 
prior distributions. 

Initialization. We start the model from maximum likelihood estimates of 
the less critical quantities, this is batch means {/i rn } m f :1 and noise variances 
Systematic factors Z and latent proteins W are initialized using stan- 
dardized Gaussian distributions. The loading matrices A and B (non-zero 
elements only) were set to ordinary least squares estimates based upon al- 
ready set Z and W, respectively. The vector of associations u was set with 
the information obtained from annotation about IG-protein assignments. 



Supplement A: Tree structure (http://lib.stat.cmu.edu/aoas/?). Fig- 
ure showing the resulting tree structure for the H1N1 /H3N2 viral challenge 
data. 

Supplement B: Data (http://lib.stat.cmu.edu/aoas/?). H1N1/H3N2 
viral challenge raw data 
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